actionGrammar: compile-time grammar optimizer (inline + prefix factoring)#2247
Merged
curtisman merged 16 commits intomicrosoft:mainfrom Apr 24, 2026
Merged
actionGrammar: compile-time grammar optimizer (inline + prefix factoring)#2247curtisman merged 16 commits intomicrosoft:mainfrom
curtisman merged 16 commits intomicrosoft:mainfrom
Conversation
Adds an optimizer pipeline (`optimizeGrammar`) gated by per-grammar
options, currently exposing two passes:
- inlineSingleAlternatives: collapse single-alternative RulesParts
into their parent, removing a layer of backtracking nesting in
the matcher. Handles child.value via a flat 3-case decision —
substitute into parent.value, hoist onto a single-part parent
without its own value, or drop when unobservable — and
propagates captured variables onto direct-capture parts when
the wrapper carries a binding.
- factorCommonPrefixes: factor common leading parts (including
partial leading string tokens) shared by alternatives within
a RulesPart, with conservative refusals on binding collisions,
cross-scope value references, mixed value-presence, and
multi-part suffixes that would lose the matcher's single-part
default-value rule.
Both passes preserve shared-array identity via an identity memo
so the dedup invariant in grammarSerializer.ts holds. Tests
cover the spacing-mode tightening, value substitution/hoist/drop
cases, and the prefix-factoring guards.
…scripts - Move grammarOptimizerBenchmark and grammarOptimizerSyntheticBenchmark from test/*.spec.ts into src/bench/ as plain node entry points. - Extract shared CONFIGS, runBenchmark, timing, and colored-speedup helpers into src/bench/benchUtil.ts. - Color the speedup column with chalk: green when >1.10x, red when <0.90x, plain otherwise. - Add 'bench', 'bench:synthetic', and 'bench:real' npm scripts; exclude dist/bench from the published package.
The inliner spreads a child rule's parts into the parent at the call site. When the child's GrammarRule[] array was referenced from multiple RulesParts (named-rule sharing established by the compiler), inlining at each site duplicated the child's parts, defeating the serializer's identity-based dedup and bloating .ag.json output proportional to the reference count. Add a one-time reference-count pre-pass over the input AST and refuse to inline any RulesPart whose body has more than one incoming reference. Single-reference single-alternative rules continue to be inlined.
The matcher treats top-level alternatives the same way it treats inner RulesPart alternatives — each is queued as its own MatchState and produces its own GrammarMatchResult. So factoring shared prefixes across top-level rules is semantically safe (cardinality and per-rule values are preserved via the existing __opt_factor capture). After nested factoring completes, wrap the top-level Grammar.rules in a synthetic RulesPart and reuse factorRulesPart with the same fixed-point iteration used for nested groups. This destroys the 1:1 correspondence between top-level rule indices and the original source. That mapping must be recovered via separate metadata if anything downstream depends on it. Also fuse the map+some pass in inlineRulesArray and factorRulesArray into a single loop that only allocates a new array once an element actually changes.
The loop existed defensively but the current grouping/intersection logic converges in a single pass: groups are seeded from rules that pairwise fail the share check, so each group's canonical prefix can't share with any other group's prefix on a second pass. Newly synthesized suffix RulesParts are intentionally not re-walked, and single-rule wrappers hit the early return. Call factorRulesPart once at each site; remove factorRulesPartToFixedPoint.
Replaces the pairwise-shape-comparison factoring with an incremental trie: rules are inserted edge-by-edge (per-token for StringPart; conservative match by typeName / array identity / matcherName for other parts), and the trie is then walked post-order to emit factored rules. Variables on wildcard / number / rules edges are canonicalized to the first inserter's name, with later inserters' value expressions remapped at emission. Behavioral notes: - Suffix sub-prefixes are now factored (e.g. 'play song x | play song y | play album z' factors both 'play' and inner 'song'). - A failed eligibility check at one fork now triggers a local bailout (members are flattened with their prefix prepended) instead of refusing to factor the whole group, allowing factoring above the failing fork. - Strengthened the wholly-consumed check to refuse empty-parts members regardless of value-presence (the matcher's default-value resolver cannot handle 'parts:[], value: undefined' inside a wrapped RulesPart). Tests: existing 90 optimizer tests pass; added 10 new tests covering the suffix-sharing win and 9 risk categories (cross-scope bailout, RulesPart array-identity preservation, strict-prefix overlap with mixed values, multi-level factoring, variable canonicalization including object-shorthand, interleaved match-order preservation, nested wrapper-variable scoping).
- Split TrieRoot from TrieNode so non-root edges are non-optional, removing four non-null assertions and the redundant edge guard. - Collapse single-line edgeKeyMatches branches; keep block scope only for the variants that read multiple fields. - Use items.flatMap(it => it.rules) at the two emission sites.
Allocate fresh '__opt_v_<n>' canonical names for every variable-bearing edge in the factoring trie instead of using the first inserter's user- supplied variable name. Eliminates two collision classes: - Outer-name shadow: under first-inserter-wins, a non-lead alternative whose value referenced an outer free variable matching the lead's local binding name would silently alias the lead's binding. Opaque canonicals cannot collide with any user name. - Bound vs. unbound RulesPart references: parity check in edgeKeyMatches keeps '<Inner>' and '$(v:<Inner>)' as separate trie children rather than silently merging (which would either invent or drop a binding). Other changes: - Split TrieStep (yielded by partsToEdgeSteps with 'local') from TrieEdge (stored in trie with 'canonical'). - Lead inserter now records its remap too (previously was a no-op). - recordStepRemap throws on conflicting overwrite of the same local to two different canonicals (defensive; unreachable in well-formed input). - Removed binding-shadow check in checkFactoringEligible (impossible with opaque canonicals); kept cross-scope-ref check, now expressed in canonical-name terms — it remains necessary because nested rule scope is fresh in the matcher (entering RulesPart resets valueIds). Tests: optimizer suite 100 \u2192 102; new tests cover outer-name shadow and bound-vs-unbound parity. Full action-grammar suite 2322 pass.
…e walks - Inline pass: unconditionally α-rename child rule top-level bindings to fresh __opt_inline_<n> names (per-parent counter), eliminating collisions in both substitute and drop branches. - Unify Substitute and Drop branches in tryInlineRulesPart; Hoist remains a separate fast path. - Batch per-parent value substitutions into a single Map and a single AST walk in inlineRule. - Unify remapValueVariables and substituteValueVariable into substituteValueVariables(node, Map<string, CompiledValueNode>); remapValueVariables is now a thin wrapper. - Add tests for branch-3 collision rename and per-parent counter uniqueness.
- inlineSingleAlternatives: handle multi-part parents that capture a value-producing child via part.variable by hoisting child.value into a synthesized valueAssignment (matches the matcher's default-value rule). Drop withPropagatedVariable / findExistingVariable in favor of in-place re-targeting of the single binding-friendly child part. - Tests: add coverage for optional-flag preservation, spacing-mode mismatches, refCount>1 inline refusal, bound-nested-rules-part regression, plus factoring eligibility and trie-edge variants.
- factorRulesPart split into thin RulesPart wrapper + flat factorRules core; top-level factoring now reuses factorRules directly instead of wrapping Grammar.rules in a synthetic RulesPart. - Trie children switched from TrieNode[] (linear edgeKeyMatches scan) to Map<string, TrieNode> keyed by stepMergeKey for O(1) insertion. GrammarRule[] array identity preserved via a per-BuildState WeakMap interner so the rules-edge merge key stays primitive without losing array-identity sharing. Removed now-unused edgeKeyMatches. - emitFromNode prefix construction made linear: appendPart (returned a fresh array per step → O(depth²)) replaced with in-place appendPartInPlace. - factorCommonPrefixes JSDoc warns that top-level factoring destroys the 1:1 source-rule index correspondence. - Inliner second-pass comment corrected: factoring never emits single-alternative wrappers itself; the second inline pass exists to collapse sub-RulesParts inside emitted suffixes that became inline-eligible. - BuildState JSDoc clarifies its scope and contrasts with RenameState. - bench/grammarOptimizerBenchmark.ts: comment documents the dist/bench path-relative grammar-file assumption. - README adds an 'Optimizer benchmarks' section explaining the build-then-run flow for pnpm run bench:synthetic / bench:real. - docs/architecture/actionGrammar.md: rewrote the Compile-time optimizations section to match current code (Hoist/Substitute/Drop sub-strategies, opaque __opt_v_/__opt_inline_ canonicals, parity check, local bailout, no fixed-point loop, top-level index caveat, factorRules vs. factorRulesPart, Map-keyed children, src/bench/ standalone scripts). - Test consolidation: merged grammarOptimizerFactoringRepro.spec.ts and grammarOptimizerTrieRisks.spec.ts into grammarOptimizerFactoring.spec.ts (28 tests preserved exactly; 5 spec files → 3). - Full action-grammar suite passes (2339 tests); prettier clean.
Adds tests for previously uncovered paths in grammarOptimizer.ts: - grammarOptimizerValueExpressions.spec.ts: exercises every CompiledValueNode arm in substituteValueVariables (inliner Substitute branch) and collectVariableReferences (factorer cross-scope-ref check) -- array, binary/unary/conditional/member/call expressions, spread element, template literal, object spread, object shorthand-with-sub. - grammarOptimizerFactoring.spec.ts: number-edge factoring (with and without optional flag) and wrapper-rule spacingMode propagation (positive + negative). - grammarOptimizerIntegration.spec.ts: optimizeGrammar early-return paths, inliner refusal when child has multiple variable-bearing parts, compiler skipping optimization on grammars with errors, and observable effect of the post-factor inline pass. grammarOptimizer.ts coverage: 86.18% -> 96.73% lines, 75.45% -> 90% branches, 98% -> 100% functions. Tests: 119 -> 142, all passing.
- Fix correctness bug in factorCommonPrefixes: bail out at any fork whose members all lack a value expression. The matcher's implicit text-default only fires for single-StringPart rules, so a wrapper rule [prefix..., suffixRulesPart] with no value would throw 'missing value for default' at finalize. Caught by the new fuzzer. - Unify factor-pass naming: wrapper bindings now use freshWrapperBinding threaded through BuildState; reserved-set scan removed. - Defer-allocate in factorParts to match factorRulesArray style. - Cleaner onlyChild (no non-null chain). - Drop duplicate JSDoc on BuildState. - Add recommendedOptimizations preset (and re-export from index). - Add grammarOptimizerFuzz.spec.ts: deterministic random grammars + inputs, asserts matchGrammar agrees with and without optimizations.
Default the agc compile command to recommendedOptimizations (matches what runtime callers should use). Add a --debug flag that disables optimizations, producing an unoptimized AST that preserves the 1:1 correspondence between top-level rules and the original source for diagnostics.
The cross-scope-ref check in checkFactoringEligible only compared member values against the immediate prefix at the current trie fork. It missed the case where a deeper bailout had already prepended an inner edge to each member, dragging in references to ancestor-prefix canonicals (variables bound several edges up the trie). An outer fork at a literal edge would then factor those members into a fresh wrapper-rule scope where the ancestor canonical was no longer visible, producing a runtime "Internal error: No value for variable '__opt_v_*'" This surfaced on playerSchema.agr's play <TrackPhrase> by <ArtistName> (alt1) play <TrackPhrase> from album <AlbumName> (alt2) play <TrackPhrase> by <ArtistName> from album ... (alt3) trie shape: the inner <ArtistName> fork bails (alt1's terminal lands with empty parts), prepending <ArtistName> to alt3's suffix; the outer 'by' fork's eligibility check then sees no canonicals in its 'by' prefix and factored anyway, lifting alt1/alt3 into a wrapper where <TrackPhrase>'s canonical was lost. Fix: tighten the check to require every member's value to only reference variables bound in *that member's own parts*. This subsumes the simpler immediate-prefix check and catches the bailout-then-factor scenario. prefix parameter is no longer needed. Add a regression test covering the exact playerSchema trie shape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a compile-time grammar optimizer to
actionGrammarthat reduces NFA/DFA size by inlining single-alternative rules and factoring common prefixes across alternatives, plus a generalization that factors common prefixes across top-level rules. Optimizations are now enabled by default inactionGrammarCompiler'scompilecommand (with a new--debugflag to disable for diagnostics).What's in here
grammarOptimizer.ts) with three composable passes:__opt_inline_<n>names to avoid collisions. Hoist / Substitute / Drop sub-strategies handle value-producing vs. valueless children.RulesPart, via a trie keyed by part-merge keys. Variable-bearing edges get fresh opaque__opt_v_<n>canonical names, eliminating outer-name shadowing and first-inserter-wins collisions.recommendedOptimizationspreset re-exported from the package index for runtime callers.agc compiledefaults torecommendedOptimizations;--debugdisables them and preserves the unoptimized AST for diagnostics.Correctness
__opt_v_*,__opt_inline_*) cannot collide with any user-supplied variable name.grammarOptimizerFuzz.spec.ts) generates random grammars + inputs and assertsmatchGrammaragrees with and without optimizations.Tests
Adds ~2300 cases across 6 spec files covering inlining, factoring, sharing, value expressions, integration, and fuzzing.
grammarOptimizer.tsline coverage is 96.7%, branch coverage 90%, function coverage 100%. Full action-grammar suite passes; prettier clean.Benchmarks
Benchmarks (
src/bench/grammarOptimizer*Benchmark.ts) are standalone scripts (not jest), runnable viapnpm run bench:synthetic/bench:realafter build. README has an "Optimizer benchmarks" section.Docs
docs/architecture/actionGrammar.md— new "Compile-time optimizations" section describing strategies, opaque canonicals, parity check, top-level-index caveat, and bench layout.packages/actionGrammar/README.md— optimizer overview and benchmark instructions.